Python dataframes with pandas and polars

Andreas Beger and Isaac Chung
PyData Tallinn x Python CodeClub
27 November 2024

Bios

Andreas Beger

  • 🏢 Data Scientist, Consult.
  • 🏃‍♂️🐌 Slow marathoner
  • 📍 🇩🇪/🇭🇷 → 🇺🇸 → 🇪🇪
  • 🎓 PhD Political Science

Isaac Chung

  • 🏢 Staff Data Scientist, Wrike
  • 🏊‍♂️🚴🏃‍♂️ Fast triathlete
  • 📍 🇭🇰 → 🇨🇦 → 🇪🇪
  • 🎓 MS Machine Learning

🐍 We are also the PyData Tallinn co-organizers.

Getting setup

Instructions for how to follow along in notebooks…GitHub codespaces?

What are dataframes?

Definition

  • Dataframes are a data type representing 2D tables
  • Where the columns have names
  • Unlike matrices or arrays, columns might have different data types
  • And the rows are identified by one or more ID variables
x y group
1 2 a
4 7 b
3 8 a
9 2 b

Why?

Imagine working with tabular data if we didn’t have dataframes and associated methods.

by_rows = [
    {"x": 1, "y": 2, "group": "a"},
    {"x": 4, "y": 7, "group": "b"},
    {"x": 3, "y": 8, "group": "a"},
    {"x": 9, "y": 2, "group": "b"}
]


by_columns = {
    "x": [1, 4, 3, 9],
    "y": [2, 7, 8, 2],
    "group": ["a", "b", "a", "b"]
}

Common dataframe operations

  • 📖 ✍️ read and write
  • 🔬 inspect
  • 🛒 select columns
  • 🔍 filter rows
  • 🥪 mutate, add columns
  • 👨‍👩‍👧‍👦 group and aggregate
  • 🤝 join other dataframes
  • 🧱 reshape wide, long

Agenda

  1. Pandas
    • Basic operations through group/aggregate
  2. Polars
    • Expressions
    • Joining dataframes and reshaping / pivoting
  3. Big picture
    • Pandas and Polars pros and cons
    • Some other frameworks

Section 1: pandas

History

Wes McKinney

  • 2008

originally built on top of numpy pandas 2 () adds support for arrow backend

Getting started

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "quarter": [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    "x": np.random.randn(12),
    "date": pd.date_range("2024-01-01", periods=12, freq="MS")
})

df.head()
quarter x date
0 1 -0.002226 2024-01-01
1 1 -0.196338 2024-02-01
2 1 0.207700 2024-03-01
3 2 0.931805 2024-04-01
4 2 0.579740 2024-05-01

Components of a dataframe

Series

df.x
0    -0.002226
1    -0.196338
2     0.207700
3     0.931805
4     0.579740
5    -1.221461
6    -0.834020
7     0.905148
8    -0.131270
9     1.788473
10   -0.793744
11   -0.376970
Name: x, dtype: float64

Columns

df.columns
Index(['quarter', 'x', 'date'], dtype='object')

Index

df.index
RangeIndex(start=0, stop=12, step=1)

Input - reading data

accidents = pd.read_csv("data/estonia-traffic-accidents-clean.csv")

Inspecting

accidents.shape
(14259, 8)
accidents.columns
Index(['date', 'persons_involved', 'killed', 'injured', 'county',
       'pedestrian_involved', 'accident_type', 'light_conditions'],
      dtype='object')
accidents.head()
date persons_involved killed injured county
0 2014-10-24 08:45:00 2 0 1 Harju maakond
1 2014-10-24 13:45:00 2 0 1 Harju maakond
2 2014-08-11 00:00:00 2 0 1 Harju maakond
3 2014-11-17 17:32:00 2 0 2 Harju maakond
4 2015-04-28 07:55:00 2 0 1 Harju maakond

Inspecting

accidents.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14259 entries, 0 to 14258
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   date                 14259 non-null  object
 1   persons_involved     14259 non-null  int64 
 2   killed               14259 non-null  int64 
 3   injured              14259 non-null  int64 
 4   county               14259 non-null  object
 5   pedestrian_involved  14259 non-null  int64 
 6   accident_type        14259 non-null  object
 7   light_conditions     14259 non-null  object
dtypes: int64(4), object(4)
memory usage: 891.3+ KB

Selecting columns

Different ways, one is indexing with []:

accidents["date"].head(4)
0    2014-10-24 08:45:00
1    2014-10-24 13:45:00
2    2014-08-11 00:00:00
3    2014-11-17 17:32:00
Name: date, dtype: object

Multiple columns

accidents[["date", "county"]].head(4)
date county
0 2014-10-24 08:45:00 Harju maakond
1 2014-10-24 13:45:00 Harju maakond
2 2014-08-11 00:00:00 Harju maakond
3 2014-11-17 17:32:00 Harju maakond

Mutating columns

Right now date is stored as a string:

accidents["date"][0]
'2014-10-24 08:45:00'
type(accidents["date"][0])
str


Convert it to proper data type:

accidents["date"] = pd.to_datetime(accidents["date"])
type(accidents["date"][0])
pandas._libs.tslibs.timestamps.Timestamp

Filtering rows

accidents[accidents["county"] == "Harju maakond"].shape
(7000, 8)


accidents["county"] == "Harju maakond"
0         True
1         True
2         True
3         True
4         True
         ...  
14254    False
14255    False
14256     True
14257    False
14258    False
Name: county, Length: 14259, dtype: bool

Mutating dataframes

accidents["killed_or_injured"] = accidents["killed"] + accidents["injured"]
accidents[['killed', 'injured', 'killed_or_injured']].head()
killed injured killed_or_injured
0 0 1 1
1 0 1 1
2 0 1 1
3 0 2 2
4 0 1 1


accidents["killed_or_injured"].sum()
np.int64(18021)

Grouping and summarizing

How many people were harmed, by accident type?

by_type = accidents.groupby("accident_type").agg({"killed_or_injured": "sum"})
by_type
killed_or_injured
accident_type
Jalakäijaõnnetus 3548
Kokkupõrge 7951
Muu liiklusõnnetus 1436
Teadmata 70
Ühesõidukiõnnetus 5016

pandas is great


2017, Wes McKinney (creator of pandas):

10 Things I Hate About Pandas

  • Inefficient memory management, need 5-10x data size
  • Eager evaluation → limited query planning
  • No multi-core

Section 2: polars

History

2020 Ritchie Vink

Uses arrow as internal representation

(Created by Wes McKinney in 2016!)

new slides

  • Out with indices
  • Out with .loc, .iloc
  • Out with [
  • In with lazy evaluation
  • Expressions

Getting started

import polars as pl

accidents = pl.read_csv("data/estonia-traffic-accidents-clean.csv")
accidents.head()
shape: (5, 5)
date persons_involved killed injured county
str i64 i64 i64 str
"2014-10-24 08:45:00" 2 0 1 "Harju maakond"
"2014-10-24 13:45:00" 2 0 1 "Harju maakond"
"2014-08-11 00:00:00" 2 0 1 "Harju maakond"
"2014-11-17 17:32:00" 2 0 2 "Harju maakond"
"2015-04-28 07:55:00" 2 0 1 "Harju maakond"

Easy to convert between the two

df = df.to_pandas()
df = pl.from_pandas(df)

Selecting columns

accidents.select("date", "county").head()
shape: (5, 2)
date county
str str
"2014-10-24 08:45:00" "Harju maakond"
"2014-10-24 13:45:00" "Harju maakond"
"2014-08-11 00:00:00" "Harju maakond"
"2014-11-17 17:32:00" "Harju maakond"
"2015-04-28 07:55:00" "Harju maakond"

Expressions

Expressions are abstract, composable data transformations that are executed with a context that provides data.

accidents.select(pl.col("date")).head()
shape: (5, 1)
date
str
"2014-10-24 08:45:00"
"2014-10-24 13:45:00"
"2014-08-11 00:00:00"
"2014-11-17 17:32:00"
"2015-04-28 07:55:00"

They can be composed

What the biggest accident, in terms of killed or injured?

accidents.select(
    # select 'killed'
    pl.col("killed")
    # add 'injured'
    .add(pl.col("injured"))
    # give the result a new column name
    .alias("killed_or_injured")
    # identify the max value
    .max())
shape: (1, 1)
killed_or_injured
i64
23

And they work in multiple contexts

  • select()
  • filter()
  • with_columns(): mutating dataframes
  • group_by() and aggregations

Filtering rows

How many accidents were in Harju county?

accidents.filter(pl.col("county").eq("Harju maakond")).shape
(7000, 8)
accidents.filter(pl.col("county").str.contains("Harju")).shape
(7000, 8)

Mutating dataframes

with_columns() + expressions

accidents = accidents.with_columns(
    pl.col("killed").add(pl.col("injured")).alias("killed_or_injured"),
    pl.col("killed").add(pl.col("injured")).truediv(pl.col("persons_involved")).alias("harmed_rate")
)
accidents.select(["date", "persons_involved", "killed_or_injured", "harmed_rate"]).head(5)
shape: (5, 4)
date persons_involved killed_or_injured harmed_rate
str i64 i64 f64
"2014-10-24 08:45:00" 2 1 0.5
"2014-10-24 13:45:00" 2 1 0.5
"2014-08-11 00:00:00" 2 1 0.5
"2014-11-17 17:32:00" 2 2 1.0
"2015-04-28 07:55:00" 2 1 0.5

Group and summarize/aggregate

group_by() + agg() or with_columns()

by_county = (accidents
             .group_by("county")
             .agg(pl.col("killed_or_injured").sum())
             .sort("killed_or_injured", descending=True)
)
by_county.head()
shape: (5, 2)
county killed_or_injured
str i64
"Harju maakond" 8423
"Tartu maakond" 1968
"Ida-Viru maakond" 1348
"Pärnu maakond" 1293
"Lääne-Viru maakond" 883

(Optional) Joining dataframes

What’s the per capita accident victim rate?

county_pop = (pl.read_csv("data/county-pop.csv", skip_rows=2)
              .rename({"County": "county", "Age groups total": "population"})
              .select(["county", "population"])
              # this has "county" in the county names, not "maakond"
              .with_columns(pl.col("county").str.replace("county", "maakond"))
              )

by_county_w_pop = by_county.join(county_pop, on="county", how="left")
by_county_w_pop.head(3)
shape: (3, 3)
county killed_or_injured population
str i64 i64
"Harju maakond" 8423 598059
"Tartu maakond" 1968 152977
"Ida-Viru maakond" 1348 136240

(Optional) Joining dataframes

Now we can use some simple select + expressions do to the math:

by_county_w_pop.select(
    pl.col("county"), 
    pl.col("killed_or_injured"),
    pl.col("killed_or_injured").truediv(pl.col("population")).mul(1000).alias("rate/1000")
    ).head(3)
shape: (3, 3)
county killed_or_injured rate/1000
str i64 f64
"Harju maakond" 8423 14.083895
"Tartu maakond" 1968 12.864679
"Ida-Viru maakond" 1348 9.894304

(Optional) Reshaping / pivoting dataframes

We’re going to use a different dataset on reflector usage for this.

reflectors = (pl.read_csv("data/reflectors.csv", has_header=True, separator=";", skip_rows=2)
              .filter(pl.col("Sex").ne("Men and women"))
              .drop(["Type of data", "Year", "All age groups (16-64)"])
)
reflectors.head()
shape: (5, 7)
Reflector use Sex 16-24 25-34 35-44 45-54 55-64
str str f64 f64 f64 f64 f64
"Never" "Men" 14.3 12.4 7.2 3.9 2.7
"Never" "Women" 8.8 5.0 4.6 2.0 2.5
"Sometimes" "Men" 46.7 36.2 30.9 26.1 28.7
"Sometimes" "Women" 29.6 26.0 20.6 14.8 13.7
"Nearly always" "Men" 34.3 40.5 52.2 58.6 55.9

(Optional) Reshaping / pivoting dataframes

reflectors = (reflectors
              .unpivot(index=["Reflector use", "Sex"], 
                       variable_name="age_group", 
                       value_name="percentage")
)
reflectors.head()
shape: (5, 4)
Reflector use Sex age_group percentage
str str str f64
"Never" "Men" "16-24" 14.3
"Never" "Women" "16-24" 8.8
"Sometimes" "Men" "16-24" 46.7
"Sometimes" "Women" "16-24" 29.6
"Nearly always" "Men" "16-24" 34.3

Plot reflector use by age and gender

(reflectors
 .filter(pl.col("Reflector use").eq("Never"))
 .plot.line(x = "age_group", y = "percentage", color = "Sex")
 .properties(width=700, height=300)
)

Modified plot

One category is “Never walk on dark streets, roads”…🧐

(reflectors
 .with_columns(pl.col("Reflector use").str.replace("Never walk on dark streets, roads", "Never"))
 .group_by(["Reflector use", "Sex", "age_group"])
 .agg(pl.col("percentage").sum())
 .filter(pl.col("Reflector use").eq("Never"))
 .sort(["age_group", "Sex"])
 .plot.line(x = "age_group", y = "percentage", color = "Sex")
 .properties(width=700, height=300)
)

Why you should plot your data 😼

df1 = pl.read_csv("data/dataset1.csv")
df1.shape
(142, 2)
stats = ["mean", "std", "25%", "75%"]
(df1
 .describe()
 .filter(pl.col("statistic").is_in(stats))
)
shape: (4, 3)
statistic x y
str f64 f64
"mean" 54.2661 47.834721
"std" 16.769825 26.939743
"25%" 39.706326 24.46783
"75%" 69.359559 71.806616
df2 = pl.read_csv("data/dataset2.csv")
df2.shape
(142, 2)
stats = ["mean", "std", "25%", "75%"]
(df2
 .describe()
 .filter(pl.col("statistic").is_in(stats))
)
shape: (4, 3)
statistic x y
str f64 f64
"mean" 54.263273 47.832253
"std" 16.765142 26.935403
"25%" 44.1026 25.2564
"75%" 64.8718 69.1026

Why you should plot pt2

df1.plot.point("x", "y")
df2.plot.point("x", "y")

🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖

The big picture

Andy is a polars stan

Comparison

pandas

  • ✅ Very widely used and supported
  • ✅ Stable
  • ❓ More imperative, traditional API
  • ❌ Inconsistent API, multiple ways of doing the same thing

polars

  • ✅ More consistent, functional-style API
  • ✅ Faster, less memory footprint
  • ✅ Works with OOM datasets out of the box
  • ❌ API still changing

Other frameworks

Thank you!

Scan this and let us know how we did 🤗